[ET-VK][q8ta_pixel_shuffle] Add fused PixelShuffle custom op for channels-packed int8 tensors by pytorchbot · Pull Request #19439 · pytorch/executorch

pytorchbot · 2026-05-09T04:58:14Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #19397 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/528/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/528/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/531/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/528/orig
Differential Revision: D104099055
@diff-train-skip-merge

…nels-packed int8 tensors Pull Request resolved: #19397 A RefineNet segmentation model spends ~860 us (~17% of inference) on the textbook decomposed PyTorch PixelShuffle chain (q8ta_dequantize -> view -> permute -> view -> q8ta_quantize) repeated four times in the segmentation head. This is wasteful: it materializes three buffers and round-trips through fp32 just to perform what is fundamentally a byte permutation on an int8 tensor. This diff introduces et_vk.q8ta_pixel_shuffle.default, a single fused kernel that operates directly on int8x4 packed buffers. Each thread writes one output int32 word (= 4 consecutive output channels at one (n, oh, ow) spatial position). Dispatch is 1D over total output int words, sized as N * div_up_4(C_out) * H_out * W_out with a 64-thread local workgroup. The four channel lanes inside an output int come from four different input int words (input channels are spaced by r*r), so each thread issues four input loads. The (oh % r, ow % r) -> input lane mapping is constant for a given thread because all four output lanes share (oh, ow). The first byte index is computed via the layout-aware helper tensor4d_idx_to_buf_idx; subsequent lanes derive their byte index by adding stride[packed_dim] * block_numel, a layout-only constant, so only one helper call is needed per thread. When input/output share scale and zero-point (the typical residual-path case), the requantize math is skipped and the kernel becomes a pure byte shuffle (selected via the passthrough push constant). The op accepts the channels-packed PACKED_INT8 family (PACKED_INT8_4W4C, PACKED_INT8_4C1W, PACKED_INT8_CONV2D) on both input and output. The partitioner routes the op into whichever channels-packed layout the surrounding q8ta_conv2d_pw / q8ta_add ops produce/consume (PACKED_INT8_4W4C on RefineNet). Restricting to the channels-packed family means the inner block axis is always C and the lane within an int word is constant per thread, which removes the need for layout-block-config spec consts in the shader. Rather than matching the decomposed view -> permute -> view chain after to_edge lowering, this diff preserves aten.pixel_shuffle.default through to_edge by adding it to the partitioner's ops_to_not_decompose list. The matcher then operates on the much simpler dq -> [clone] -> aten.pixel_shuffle.default -> [clone] -> q form. This keeps the matcher robust against edge-dialect / clone-insertion variations. Pieces in this diff: - Partitioner / fuser: - partitioner/vulkan_partitioner.py — adds aten.pixel_shuffle.default to ops_to_not_decompose so the framework preserves the op through to_edge lowering. - patterns/quantized_pixel_shuffle.py — detects dq -> [clone] -> aten.pixel_shuffle.default -> [clone] -> q and rewrites it to et_vk.q8ta_pixel_shuffle.default. Transparently skips clone / _clone_dim_order nodes between any pair of nodes. - Runtime kernel: - runtime/graph/ops/glsl/q8ta_pixel_shuffle.glsl + .yaml - runtime/graph/ops/impl/Q8taPixelShuffle.cpp + .h - Op definitions: - custom_ops_lib.py: register et_vk.q8ta_pixel_shuffle (Python op definition). - op_registry.py: inputs_storage = utils.PACKED_INT8_CHANNELS_PACKED_BUFFER. - Tests: - test/custom_ops/impl/TestQ8taPixelShuffle.cpp: test op that runs q -> [fused | unfused chain] -> dq, with selectable input/output int8 layouts via str args. The op accepts the channels-packed family; the layout_from_string helper currently exercises 4W4C. - test/custom_ops/test_q8ta_pixel_shuffle.cpp: 16 ACCU + 8 PERF cases (4 shapes x 2 qparam settings x 2 impl_selectors x 1 layout combination, 4W4C -> 4W4C). - test/test_vulkan_passes.py: positive and negative pattern-matcher unit tests against the un-decomposed form. ghstack-source-id: 379519848 @exported-using-ghexport Differential Revision: [D104099055](https://our.internmc.facebook.com/intern/diff/D104099055/)

pytorch-bot · 2026-05-09T04:58:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19439

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Long GPU queue (g5, g6) on LF fleet

❌ 1 New Failure, 46 Pending

As of commit 9404eb5 with merge base c564936 ():

NEW FAILURE - The following job has failed:

pull / unittest-wasm-bindings (--enable-etdump) / linux-job (gh)
curl: (22) The requested URL returned error:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorchbot requested a review from SS-JIA as a code owner May 9, 2026 04:58

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 9, 2026

SS-JIA approved these changes May 9, 2026

View reviewed changes

SS-JIA merged commit 117ffb4 into gh/SS-JIA/531/orig May 9, 2026
171 of 178 checks passed

SS-JIA deleted the gh/SS-JIA/528/orig branch May 9, 2026 05:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][q8ta_pixel_shuffle] Add fused PixelShuffle custom op for channels-packed int8 tensors#19439

[ET-VK][q8ta_pixel_shuffle] Add fused PixelShuffle custom op for channels-packed int8 tensors#19439
SS-JIA merged 1 commit into
gh/SS-JIA/531/origfrom
gh/SS-JIA/528/orig

pytorchbot commented May 9, 2026

Uh oh!

pytorch-bot Bot commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pytorchbot commented May 9, 2026

Uh oh!

pytorch-bot Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19439

❗ 1 Active SEVs

❌ 1 New Failure, 46 Pending

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented May 9, 2026 •

edited

Loading